CoPheScan: phenome-wide association studies accounting for linkage disequilibrium

Phenome-wide association studies (PheWAS) facilitate the discovery of associations between a single genetic variant with multiple phenotypes. For variants which impact a specific protein, this can help identify additional therapeutic indications or on-target side effects of intervening on that protein. However, PheWAS is restricted by an inability to distinguish confounding due to linkage disequilibrium (LD) from true pleiotropy. Here we describe CoPheScan (Coloc adapted Phenome-wide Scan), a Bayesian approach that enables an intuitive and systematic exploration of causal associations while simultaneously addressing LD confounding. We demonstrate its performance through simulation, showing considerably better control of false positive rates than a conventional approach not accounting for LD. We used CoPheScan to perform PheWAS of protein-truncating variants and fine-mapped variants from disease and pQTL studies, in 2275 disease phenotypes from the UK Biobank. Our results identify the complexity of known pleiotropic genes such as APOE, and suggest a new causal role for TGM3 in skin cancer.


Main
Phenome-wide association studies (PheWAS) are an inversion of the GWAS (Genome-Wide Association Studies) paradigm, where a single genetic variant is tested against a broad range of phenotypes.Phenome scale studies are facilitated by the availability of a broad array of phenotypes linked to genomic data in large-scale biobanks.PheWAS are a promising tool in the field of pharmacogenomics as they facilitate drug repurposing efforts and identification of potential adverse effects due to their ability to detect pleiotropy [1][2][3] .Often, PheWAS has been paired with other approaches such as Mendelian Randomisation to identify causal effects of exposures on outcomes and network analysis to identify interactions between phenotypes [4][5][6] .
Prevailing methods for phenome-wide testing are built upon single variant tests and do not inherently tackle the spurious associations that can arise when traits are causally associated not with the index variant, but with another variant in LD with the index variant.For instance, a PheWAS of UK Biobank phenotypes with protein-truncating variants by DeBoever et al 7 first revealed an association between an ANKDD1B variant, and high cholesterol, which was found to reflect an indirect association, through LD with an intronic variant in HMGCR which is known to be associated with cholesterol levels.Thus LD confounding necessitates the use of additional follow-up tests such as colocalisation analyses, where pairs of traits are tested for shared causal variants within a genomic region, to isolate associations that are truly causal 3,8 .PheWAS hits are colocalised with molecular QTLs or disease traits on which the identified variants have a prior known effect.However, this two-step approach is not feasible for variants with known biological effects for which summary statistics are unavailable, such as those involved in protein truncation.
In this work, we introduce a Bayesian approach to PheWAS, Coloc adapted Phenome-wide Scan, (CoPheScan), that tests phenome-scale causal associations with a set of index variants while handling confounding due to LD at the same time.CoPheScan can exploit external covariate data, such as the genetic correlation between phenotypes, and can be run in different ways depending on whether accurate LD information is available and whether the analyst is prepared to make assumptions about the number of causal variants in the tested genomic region.We demonstrate the utility and robustness of these different approaches on simulated datasets.We also analysed causal variants selected from three real-world sources and tested for causal associations against 2275 phenotypes from the UK Biobank using CoPheScan.

Overview of CoPheScan
CoPheScan is an adaptation of the coloc [9][10][11] approach, for the case where a variant known to be causal either through fine-mapping or functional studies, is subjected to a phenomewide scan to test for causal associations with other phenotypes/traits.Coloc considers the genetic association patterns for two traits in a genomic region and assesses whether it is likely they share a causal variant in that region.It is a Bayesian approach and assumes prior probabilities for each of the five possible hypotheses (no association with either trait, association with just one trait or the other, association with both traits and different causal SNPs, or association with both traits at the same causal SNP) are fixed and known.We consider the case where a SNP of interest is known to be causal for a phenotype which is often the case in PheWAS, and we are interested in determining if it is also causally associated with another phenotype (Figure 1a).We will hereafter refer to the variant of interest as the query variant, the phenotype for which the query variant is known to be causally associated as the primary trait and the phenotype to be tested as the query trait.In a genomic region with Q SNPs, and under the initial assumption of a single causal variant (which we will relax later), there are Q+1 possible configurations, each corresponding to exactly one of three hypotheses: The posterior odds for each hypothesis (H ) given the data (D) for the query trait with respect to the null hypothesis (H n ) is given by, (1)   In equation ( 1), the first ratio in the right-hand side is the prior odds and the second ratio is the Bayes Factor (BF).Thus, the three prior probabilities that have to be specified are: , and p c =P( H c ), subject to the constraint that p n +( Q−1) p a + p c =1.Here, when many query variant/query trait pairs are being tested, we allow these priors to be learnt from the data using a hierarchical model in contrast to the fixed prior approach of coloc, although fixed values can also be supplied (Supplementary methods).We also allow exploiting covariate information such as genetic correlation ( r g ¿ between primary and query traits by adapting the prior formulation to depend on covariates (see Supplementary Methods).
The restriction to a single causal variant allows us to count the possible configurations (Q+1), and if the assumption is deemed valid, CoPheScan can be run directly on summary GWAS data using Wakefield's method 12 , to compute approximate Bayes factors summarising the relative support for a model where the SNP is associated with a trait compared to the null model of no association.However, this assumption is not broadly valid, and an alternative is to use the Sum of Single Effects (SuSiE) Bayesian fine mapping regression framework 13,14 to partition the evidence into configurations corresponding to each of multiple possible causal variants and use these in a similar manner to allowing for multiple causal variants in coloc 10 .The SuSiE approach works best with either raw genotype data or summary GWAS data when in-sample LD information is available 15 .
Hence, CoPheScan has the flexibility to be run in several ways (Figure 1b) depending on: (i) the assumption about the number of causal variants, (ii) the specification of either fixed or hierarchical priors, and (iii) the inclusion/exclusion of covariates if the hierarchical model is used to infer priors.A detailed description of the CoPheScan method is available in the Supplementary methods.A summary of the simulated data, variant and phenotype sources used for the analysis with the real data can be found in (Figure 1c), while a detailed description is provided in the Methods.The inputs are GWAS summary statistics from multiple traits and the position of the query variant.Computation of the posterior probabilities of the three hypotheses is performed with priors and Bayes factors computed using different CoPheScan approaches.(c) Study design for evaluation: Simulated data -Generated using SimGWAS and all CoPheScan approaches were run on this set.Real data -Phenotypes tested were obtained from UK Biobank and variants from fine-mapping FinnGen and a proteome dataset 16 .Hierarchical priors and SuSIE BF were used on the real data to identify SNP-disease associations.(QV -query variant, QT -query trait)

Simulations show CoPheScan is more accurate than a standard method which does not account for LD confounding
We simulated regional GWAS summary data for traits with either zero, one or two causal variants (Methods) such that they corresponded to the three CoPheScan hypotheses.We also allowed the probability of Hc to vary according to a simulated genetic covariance between primary and query traits and considered whether including this information in the analysis increased inferential accuracy.We analysed the same data in parallel using a conventional PheWAS approach of testing each of the set of query SNPs for association, controlling either the FDR or the family-wise error rate via Bonferroni correction.We compared these to the results from CoPheScan, using a hierarchical model (with and without the covariate data) or fixed priors chosen as described in the Supplementary methods which broadly matched the proportion of Hn, Ha, and Hc in the sample.
First, we considered the appropriate threshold on the posterior probability of Hc, ppHc, to call an association.We estimated the FDR internally, as 1-mean(ppHc) | ppHc > t for different values of threshold t (Supplementary Figure 3).We found that ppHc>0.6 maintained an FDR < 0.05 across all analyses of simulated data.Using this threshold, CoPheScan appeared less sensitive to the presence of a single causal variant (true Hc) than the conventional BH approach but more sensitive than the Bonferroni approach (Figure 2).Cophescan demonstrated control of the FDR (0.026-0.039) estimated as the proportion of significant calls that were truly Hn or Ha , traits where the query variant was not causal, for the different CoPheScan approaches compared to 0.219 and 0.308 for the conventional BH and Bonferroni approaches respectively, (Supplementary Table 1).The majority of these false positives were true Ha called as associated due to LD confounding.All CoPheScan approaches performed well in the case of a single causal variant, but when there were two causal variants (True Hc2), using SuSIE resulted in approximately 30% higher sensitivity to correct Hc predictions than the ABF approach (Figure 2).This was balanced against marginally lower (<0.5%)sensitivity to Hc with SuSiE when traits truly had only a single causal variant (True Hc) when compared to the CoPheScan approaches that assumed a single causal variant.Although the effect of including covariate information was minor overall, Figure 3a shows that it had a substantial effect in a minority of cases, bringing ppHc from below to above 0.6 in 2.79% of true Hc and Hc2 cases (80/2867), although also in 0.088% of true Hn and 0.011% of true Ha and Ha2.
Finally, these initial simulations showed that the hierarchical model recovered very similar results to the fixed prior model, where we chose our fixed prior values to broadly match the simulation scenarios, i.e., an optimal scenario.This offers reassurance that the hierarchical model can perform just as well as a method that "knows" the correct prior values.However, in real data, we will not know the true proportion of Hn, Hc, or Ha in our data, so we explored the robustness of both approaches to variations in these proportions.We found that using over-optimistic fixed priors, i.e. when the prior probability for Hc (P(Hc)=0.091)exceeded the proportion of Hc in our data, led to dramatically high FDR, whilst the hierarchical model correctly adapted to the different datasets so that the FDR was controlled except at the very lowest true proportions of Hc (Figure 3b).

Using genetic correlation as a covariate increases detection of associations with disease-causal variants
We explored the performance of CoPheScan (Supplementary Figure 4) using a variety of causal variants sets to perform PheWAS in three sets of query variants in up to 2275 query traits from the UK Biobank summary data provided by the Neale Lab (http://www.nealelab.is/uk-biobank/).First, 136 disease-causal variants were identified as single variant credible sets in fine mapping data from FinnGen disease endpoints (primary traits, https://www.finngen.fi/en/access_results).We identified causal associations in UKBB at 43 (31.62%) of these, predominantly amongst query traits identical or related to the primary trait.Out of 101 unique query-variant-primary trait pairs with exact query variantquery trait matched pairs in UKBB, 32 were found to be Hc (Supplementary Figure 7), and 65 Hn due to a lack of power in UKBB (p-value> 10 -5 ).Four cases were called Ha, and in these the UKBB p value was small, but the fine mapping produced different results in UKBB and FinnGen (Supplementary Figure 7).
Genetic correlation information for only 1582 out of the 2275 traits used in analysis without rg was available.rg values between the 1582 query traits and 69 UKBB traits which were matched with the FinnGen primary traits were used as a covariate (130697 query trait-query variant pairs tested).Including rg in the hierarchical model made a larger difference here than in the simulated data, perhaps reflecting a stronger effect than we anticipated in our simulations.Overall, ppHc values for traits with higher rg with the primary traits increased and, conversely, decreased for traits with lower (negative) rg (Figure 4).Incorporating the rg resulted in the identification of 19 additional associations (Supplementary Table 8).For example, the variants rs3217893_C>T and rs2476601_A>G, fine-mapped for type 2 diabetes and rheumatoid arthritis (RA) in FinnGen respectively, were found to have associations with medications gliclazide, which is a sulfonylurea used in the treatment of Type 2 diabetes, and steroid prednisolone which can be used to treat RA, only when the genetic correlation information was included.Query variants were often associated with multiple UKBB traits (median 5) that reflected related diseases and medications (Supplementary Table 8).For instance, rs11591147_G>T, a missense variant of PCSK9, identified as a disease-causal variant in FinnGen for statin medication was found associated with the UKBB traits related to different statin medications along with several cardiovascular traits.Less commonly, we found evidence for causal association of variants to seemingly unrelated traits.For example, rs9349379_A>G, an intron variant and eQTL for PHACTR1, identified by fine-mapping the FinnGen primary traittriptan, which is a medication used to manage migraine, was found to be associated with several UKBB traits related to migraine such as the phenotype itself, migraine medications such as sumatriptan, ibuprofen and paracetamol and also the presence of family history.However, we also found associations with angina, myocardial infarction and ischaemic heart disease, with the migraine-protective allele acting as a risk factor for cardiovascular traits.This matches results from a Mendelian randomisation study of migraine and cardiovascular disease 17 but is in contrast to observational studies where migraine is considered positively associated with cardiovascular traits 18 .Such discrepancies between genetic and observational studies in other traits have often been resolved in favour of the genetic result, through the identification of some confounding factor which led the observational studies to report inverse relationships, and it has been suggested that certain non-triptan migraine therapies might act to increase cardiovascular risk 17 .However, this pleiotropy did not appear at another migraine-identified variant, rs11172113_T>C, an intronic variant of LRP1, which was fine-mapped for the same FinnGen primary trait of migraine, and found to be independently associated with several migraine-related phenotypes in UKBB but not with any of the cardiovascular traits (Figure 5).Other examples of pleiotropic variants include rs2476601, a non-synonymous variant in PTPN22 which we found to be causally associated with multiple autoimmune diseases and their treatments as well as skin cancer, with the autoimmune-protective allele increasing risk of cancer (Supplementary Figure 8).We also found a complex set of associations with two variants in APOE, rs429358 and rs7412 that jointly define the three major structural isoforms of APOE 19 , ε4, ε3 and ε2 (Supplementary Figure 9).ε2 represents the TT haplotype corresponding to the rs429358 and rs7412 variants, ε3 is represented by TC and ε4 by the CC haplotype 20 .We found associations with increased risk of Alzheimer's disease, statin medication, angina and ischemic heart disease with the ε4 allele with reference to the ε2/ε3 genotype.We also found a protective effect of ε4 compared to ε2/ε3 on traits related to a family history of diabetes and blood pressure which correspond to similar traits found in FinnGen as well as a protective effect of ε3/ε4 compared to ε2 for deep venous thrombosis might be related to the ε3/ε4 genotype with reference to ε2 and might indicate the ε2 allele.
These findings align with previous studies on disease associations with different APOE genotypes 21 and highlight the ability of SuSiE to map traits to distinct alleles in LD.

Individual variant analyses
CoPheScan can also be used to study single variants if sensible prior values can be supplied.We considered exemplar non-synonymous variants in two genes, TYK2 with established allelic heterogeneity and associations to multiple immune-mediated diseases, and SLC39A8, with established pleiotropic function.We ran CoPheScan with SuSiE BF and priors inferred from the disease-causal variant analysis above ( p a ≈ 3.82e-5 and p c ≈ 1.82e- 3), considering as query traits 2275 UKBB and 56 additional traits potentially related to either gene from the GWAS catalog (Supplementary Table 3).
TYK2 which encodes the tyrosine kinase 2 enzyme has multiple missense variants that have been associated with a range of immune-mediated diseases (Supplementary Table 11).We considered four: rs35018800_G>A (MAF: 0.0082), rs34536443_G>C (MAF: 0.0465), rs12720356_A>C (MAF: 0.0979), and rs55882956_G>A (MAF: 0.0017).rs35018800_G>A and rs55882956_G>A with the lowest MAF showed no association with any trait.rs34536443_G>C was associated with 3 UKBB and 5 GWAS catalog traits, all immunerelated and previously established associations, including psoriasis, RA, JIA (Juvenile Idiopathic Arthritis), Type 1 DM, and hypothyroidism.The variant rs12720356_A>C was associated with ulcerative colitis, psoriasis, Crohn's disease, SLE (Systemic Lupus Erythematosus) and RA traits from the GWAS catalog, but not with any of the UKBB traits (Figure 6).The highly pleiotropic variant, rs13107325_C>T, of SLC39A8 (solute-carrier family gene which encodes the ZIP8 protein), was associated with 14 UKBB and 3 GWAS catalog phenotypes, replicating several known associations 22 with hypertension, schizophrenia, Crohn's disease, urinary incontinence, musculoskeletal system-related traits such as osteoarthritis and traits related to alcohol dependence.
We used this region to perform a sensitivity analysis, selecting four variants -rs6855246, rs35225200, rs35518360, rs13135092, in LD with rs13107325_C>T (r2=0.816-0.943) and running CoPheScan as if each had been selected as the causal variant.This allows us to explore two related questions: either, to what extent can two causal variants in LD cause false positive findings, or, to what extent CoPheScan might still detect an association if the "causal" variants supplied to CoPheScan are not really causal, but in LD with the causal variant.We found that CoPheScan was indeed sensitive to this misspecification, where out of the 17 traits identified as causally associated with rs13107325, 4 had ppHc < 0.6 with rs13135092 (r2=0.943)and 11 with rs6855246 (r2=0.816).The results were increasingly discrepant as the r2 with rs13107325_C>T decreased (Figure 7 and Supplementary Figure 10).The group of traits with high ppHc across multiple variants tended to have larger minimum p values in the region compared to those for which ppHc was low across multiple variants, suggesting that CoPheScan will be best at discriminating between potential causal variants in LD when the association signal in the query data is strong.Finally, we sought to verify previously proposed causal associations between the HMGCR variant rs12916_T>C and metabolic traits.HMGCR encodes HMG-CoA reductase which is targeted by statins to lower LDL cholesterol.Previously, HMGCR variants have been used as a proxy for statin effect to show a higher risk of type 2 diabetes and body mass index (BMI) in MR studies 23 .But the validity of this has been challenged with evidence that there may be distinct causal variants underlying type 2 diabetes, BMI and HMGCR levels 24 .We performed CoPheScan analysis on the UKBB traits: LDL, BMI, type 2 diabetes, waist circumference and weight.We identified a known causal association with LDL (ppHc = 1).
Despite significant observed p-values at rs12916 at BMI, weight and waist circumference, however, CoPheScan consistently concluded that while the region contained a causal variant for each trait, it was not rs12916 (ppHa > 0.99).In fact, no credible sets were identified in the HMGCR gene region and the SuSIE signals from these traits indicate the presence of an alternative causal POC5 variant (Supplementary Figure 11).This implies that genetic studies that demonstrated a relationship between statin therapy and BMI/T2DM through HMGCR variants as a proxy might be incorrect 24 as they studied the SNPs in isolation while ignoring their regional context 25 .CoPheScan is thus valuable in verifying assumptions in instrumental variable analyses.

PheWAS of protein-associated variants
One challenge of GWAS has been to link disease associations to their causal genes.PheWAS allows us to start with variants with known causal function on a protein and ask which diseases are also causally associated, exploiting the low false positive rate of CoPheScan.We began with 505 plasma protein QTLs 16 identified as single variant credible sets in fine-mapping of 527 plasma proteins.Nine variants were identified to be associated with UKBB traits (Table 1 and Supplementary Table 9).Among the established associations, we found an association between a pQTL for APOC1 and high cholesterol, as well as reported treatment with the cholesterol-lowering simvastatin.Both associations make sense given the known biology of APOC1, but only the first would have been detected in scanning for significant p values, as the p-value for high cholesterol at this SNP (p=6.19 x 10 - 19 ) is much lower than for simvastatin (p=9.59x 10 -4 ), emphasising the value of exploiting the additional information that we believe the variant to have a causal effect on a measurable phenotype (Supplementary Figure 12).We also found a novel association, rs214830_G>C, a pQTL for TGM3, was associated with skin cancer (ppHc=0.75).TGM3 is required for skin development and is normally expressed in the spinous/granular layers of the epidermis.Its expression was found to be absent in melanoma and squamous cell carcinoma of the skin but strongly expressed in basal cell carcinoma (BCC), suggesting it could be a specific marker for BCC diagnosis 26 .Association of variants in TGM3 with BCC have also been reported [27][28][29] but rs214830_G>C was not always the top variant and GWAS associations can mark causal effects in neighbouring genes 30 .Our analysis suggests this association could be directly causal, with TGM3 involved in the development of BCC as well as acting as a biomarker.
Finally, we considered 3586 variants labelled as protein-truncating (PTV) in the UKBB summary data with MAF > 0.001, consisting of those predicted by VEP to be stop_gained, frameshift, splice_acceptor and splice_donor.The fraction of query variants that were found to be causally associated with at least one trait in UKBB was much lower for PTV (~0.31%) than for disease-causal variants identified in FinnGen (~40%) and pQTL (~1.8%) (Table 2, Supplementary Figures 5 and 6).
Examination of the Markov chain Monte Carlo (MCMC) chains showed issues with mixing for the PTV example which were not seen with the other datasets (Supplementary Figure 4).When we examined the inferred priors (Supplementary Table 12) obtained from this model, we observed that the pc/pa ratio was ~1.02, indicating that the inferred pa and pc priors were almost the same.Our PTV consisted of four VEP classes, but while the MAF distribution of the stop-gained PTV was similar to missense variants, those of the other PTV (frameshift, splice donor and splice acceptor) were similar to synonymous variants (Supplementary Figure 13a).As selection can constrain MAF, we hypothesised that the VEP stop_gained class might be more enriched for functional variation than the set of four classes we had used.We considered two ways to enrich the PTV set for functional variation: either using just this subset of the stop-gained PTVs or using the PTVs which were also defined as high confidence homozygous predicted loss-of-function (pLoF) variants in gnomAD. 31pLoF were predominantly rare, such that the pLoF subset of PTV variants had a higher number of rare variants compared to the stop-gained subset (Supplementary Figure 13b).
We ran the hierarchical models for these two subsets of PTVs (Supplementary Figure 14).
Comparing the priors (Supplementary Table 12) across the different datasets tested we observed that the ratio of prior probabilities for the query variant or a non-query variant to be causal, pc/pa (Table 2) obtained using the pLoF variants (2.59) was second only to the ones obtained using the FinnGen disease-related variants.The ratio from the stop-gained variant model (1.39) was similar to the pQTL variant model (1.28).This shows that sets of query variants which have a higher functional enrichment are expected to have a high pc/pa ratio.QV -query variant, QT -query trait, N QV-QT number of trait-variant associations.The number of QT were lower for the FinnGen/UKBB dataset for the 'with rg' case as only traits having rg data with the primary traits available were retained (Methods).
26 associations were identified using all the PTV variants.All 15 associations detected with the stop-gained PTVs and 7 from the pLOF overlapped with those from the whole set.Of the combined 26 PTV-trait associations (Supplementary Table 10), many corresponded to known effects.One of them is, rs2066847_G>GC, a NOD2 frameshift mutation, which is reported as a pathogenic variant for inflammatory bowel disease in ClinVar and was associated with several phenotypes related to Crohn's disease and mouth ulcers in our analysis (Supplementary Figure 15).However, as seen with migraine and cardiovascular disease above, the association with mouth ulcers occurs in the opposite direction to the established comorbidity of Crohn's disease and mouth ulcers in the population, with the Crohn's disease risk allele appearing protective for mouth ulcers.Note that in the mouth ulcer trait, the effect sizes were opposite in two other SNPs identified as a credible set in SuSiE analyses of both traits (Supplementary Figure 15).

Discussion
Detection of pleiotropic effects of genetic variants is an essential component of target discovery and drug repositioning.PheWAS typically takes information from marginal statistics at query variants in isolation of their neighbours, which can lead to false positives when multiple causal variants exist in some LD.CoPheScan considers not only how small a p-value is at a given variant, but how small it is in comparison to its neighbours, and estimates how much upweighting should be applied due to the information that the variant is in a query variant set.In our simulations, CoPheScan showed considerably better control of false positive calls compared to a standard PheWAS approach, at the cost of lower sensitivity where multiple causal variants exist in a region.Whilst the higher false positive rate for standard PheWAS testing can be mitigated by the use of a second-stage analysis testing for colocalisation, that is not possible in the case of query SNPs selected for their known effects on a protein, such as the PTV considered here.
CoPheScan learns how much to upweight query variants through the prior parameter pc and the ratio of average pc to average pa is a useful measure of enrichment of causal variants for the set of query traits amongst the set of query variants.This measure can be used to assess the quality of any choice of variant set, with values close to 1 indicating a weak choice.It may vary considerably across query variant sets for the same set of query traits, as seen in the PTV analyses.However, while restriction to a smaller set of query variants with greater enrichment is likely to find a higher proportion of causal associations with the smaller set, this will not necessarily enhance discovery: whilst the majority of the discoveries found using the smaller, more enriched sets of PTV were also found in the larger unfiltered set, this restriction also meant losing plausible discoveries that didn't fall into either of the more restricted classes.
We allow pc to vary between variants by exploiting additional external information in a regression framework.In our disease-variant focused analysis, we used the genetic correlation between index and query traits, but this could also be a categorical variable, such as the predicted deleteriousness of a missense variant, or the level of evidence for the functional effect of a PTV.
While the simulations emphasised the importance of learning pc in a hierarchical model for accurate inference, point estimates can be substituted if required.This borrowing of priors from a larger dataset is beneficial in scenarios where we might want to use CoPheScan to test associations between a small set of variants and phenotypes, as running a hierarchical model on limited data will not result in optimal prior estimates.However, we strongly advise that careful consideration is needed to ensure the larger dataset in which the priors are learnt is a good match for the limited dataset under consideration.
One of the advantages of incorporating SuSIE in CoPheScan is the ability to detect allelic heterogeneity at a locus.We demonstrated this with two well-known distinct variants in the TYK2 gene which were associated with overlapping sets of immune-mediated disorders.This analysis also highlighted the importance of surveying disease-specific GWAS studies and not relying solely on biobanks which may hold relatively low numbers of cases of any individual disease.For example, only three UKBB traits showed any association compared to seven of our curated immune-mediated disease GWAS, and while psoriasis in UKBB (4192 cases) was identified with one variant, psoriasis in Tsoi's GWAS study (10558 cases) was identified with two.While biobanks remain incredibly useful for common traits such as cardiovascular and metabolic diseases, carefully curated bespoke GWAS of less common traits should be included in any PheWAS to complement the biobank resources and reveal the full spectrum of pleiotropy.This is particularly important because predicted beneficial effects of targeting a protein may be countered by on-target side effects on other traits, as we saw where the autoimmune-protective variant in PTPN22 was associated with an increased risk of skin cancer.
GWAS causal variants, even when identified with confidence, remain challenging to interpret partly because it can be hard to link them with confidence to their causal genes.Proteinaltering variants have thus become increasingly important because their function on a gene is presumed known.The different relative enrichments in different sets of PTV we ran suggests that incorporating external evidence on the plausibility of a putative PTV having a functional effect will increase accuracy in PheWAS of these variants.However, as highlighted here, they often have very low minor allele frequencies.Thus, larger biobanks are still needed both for analysis of less common traits with common variants and for analysis of rare functional variants.It is thus encouraging that UKBiobank and FinnGen studied here are complemented by the Japan Biobank 32 , the Million Veteran Program 33 and the Uganda Genome Resource 34 , which should allow CoPheScan, together with efforts at multiple ancestry fine mapping 35 , to reveal more completely the pleiotropic spectrum of protein-altering genetic variation.
UKBB traits with pre-computed genetic correlation data.Thus, 75 variants from these matching traits were used for the hierarchical model using rg (Supplementary Tables 4 and  5).

pQTL variants
We downloaded summary data from a GWAS of plasma protein levels measured with 4,907 aptamers (corresponding to 4719 proteins) in 35,559 Icelanders from Ferkingstad et al 16 .We fine-mapped the region around each signal under a single variant assumption.This is equivalent to taking only the first signal in a stepwise fine-mapping procedure.We made this conservative choice to address the lack of access to an LD matrix for the Icelandic population, making it difficult to trust secondary signals found by stepwise regression or other multiple causal variant methods such as SuSIE.We obtained 505 SNPs associated with 527 proteins for testing associations with the UKBB phenotypes (Supplementary Table 6).

Query phenotypes
We used 2275 phenotypes from the UK Biobank ( http://www.nealelab.is/uk-biobank).We obtained in-sample linkage disequilibrium matrices from https://broad-alkesgroup-ukbbld.s3.amazonaws.com/UKBB_LD 15.We included all the 2275 traits in the CoPheScan analysis of the FinnGen, pQTL and PTV variants.We downloaded genetic correlation 44 data between UK Biobank traits and disorders estimated using LD score regression 45 from https://ukbb-rg.hail.is/.In the FinnGen/UKBB dataset, only 1582 out of the 2275 traits had genetic correlation estimates with the UKBB traits mapped to the FinnGen primary traits.So, only these traits were used for the hierarchical model that included rg.Additionally, we checked the allele counts (AC) of the H n : No association with the query trait (one configuration) H a : Causal association of a variant other than the query variant with the query trait (Q-1 configurations) H c : Causal association of the query variant with the query trait (1 configuration)

Figure 1 :
Figure 1: Introduction and evaluation of the CoPheScan method.
methodology: Hypotheses with illustrations of the configurations of genetic variants within the genomic region and corresponding priors.(b) Schematic of the CoPheScan workflow.

Figure 2 :
Figure 2: Results for hypotheses discrimination in simulated data.

Figure 3 :
Figure 3: Effects of covariate inclusion and varying proportions of simulated hypotheses.

Figure 5 :
Figure 5: Causal associations of MIgraine related variants